In this notebook we tried to analyze the bet odds and their implied probabilities. This is our first approach to predict outcomes of EPL matches. We thought that bet odds are reliable since bet companies constantly analyze every aspect of teams, including newest events and statuses of players and all related variables. So they probably perform different kinds of ML and statistical approaches to determine their odds. So we decided to analyze their odds and try to find outcome probabilities based on their probability. We tried to find minimum RPS among the bet companies. Since the first rounds of submissions were early, we were not ready to construct our ML models. In the meanwhile we did not want to fall behind in competition, so decided to use this approach for our initial submissions. Also we kept analyzing probabilities with this approach to make a backup in case of having ābadā outcomes from ML models.
library(data.table)
We read finished matches from the data (we update data just before every submission round). We also try to determine the outcome as -1,0,1 corresponding to away win, draw, home win by looking match scores.
# Read frome the snapshot, select only columns we're interested in.
matches <- fread(paste0(getwd(), "/data/matches.csv"), select=c("match_id", "match_hometeam_name", "match_awayteam_name", "league_id","match_status","match_hometeam_score", "match_awayteam_score", "epoch"))
# get only finished english premier league matches
finished_epl_matches <- matches[league_id == 148 & match_status == "Finished"]
finished_epl_matches[, outcome := (ifelse(match_hometeam_score == match_awayteam_score, 0 , ifelse(match_hometeam_score > match_awayteam_score, 1 , -1)))]
finished_epl_matches
We read bet data (again we update them just before round submission). We also have a ābet_extraā datatable which is something we tried to construct from online data. Some matchesā bet data were missing in given csv file(on google drive). So we had to find these bet data on our own. We used www.betexplorer.com to gather missing data. Each missing match we gathered data from the webpage and put them into our bet_extra.csv file. Then we merge two files ābets.csvā, ābets_extra.csvā.
# read bets csv
bets <- fread(paste0(getwd(), "/data/bets.csv"))
# add column to indicate if match ended as draw
# get only bets data with win, draw or lose.
bets <- bets[variable %in% c("odd_1", "odd_x", "odd_2" )]
bets_extra <- fread(paste0(getwd(), "/data/bets_extra.csv"))
names(bets_extra)[1] <- "match_id"
names(bets_extra)[2] <- "odd_bookmakers"
names(bets_extra)[3] <- "odd_1"
names(bets_extra)[4] <- "odd_x"
names(bets_extra)[5] <- "odd_2"
modified_time = file.info(paste0(getwd(), "/data/bets.csv"))$mtime
bets_extra = melt(bets_extra, id.vars = c("match_id", "odd_bookmakers"),
measure.vars = c("odd_1", "odd_x", "odd_2"))
bets_extra[, odd_epoch := as.numeric(modified_time)]
bets <- rbind(bets, bets_extra)
bets = bets[order(match_id)]
bets
We merge bet and finished epl matches.
# join bets and match data by match_id
merged_data <- merge(finished_epl_matches, bets, by='match_id')
merged_data
We created 3 functions to find RPS on table. RPS_single was for test purposes to, we actually donāt use it. We use RPS_matrix in this anaylsis. It takes probabilities(away,draw,win) and their corresponding outcome set. Then finds an RPS value. RPS_model is constructed for our ML model.
RPS_single<- function(probs,outcomes){
probs = cumsum(probs)
outcomes = cumsum(outcomes)
RPS = sum((probs-outcomes )^2) / (length(probs)-1)
return(RPS)
}
RPS_matrix<- function(probs,outcomes){
probs=as.matrix(probs)
outcomes=as.matrix(outcomes)
probs=t(apply(t(probs), 2, cumsum))
outcomes=t(apply(t(outcomes), 2, cumsum))
RPS = apply((probs-outcomes)^2,1,sum) / (ncol(probs)-1)
return(RPS)
}
RPS_model<- function(home,draw,away,outcome){
probs = cbind(home,draw,away)
outcome = as.integer(outcome)
outcome_h = as.integer(outcome == 1)
outcome_d = as.integer(outcome == 2)
outcome_a = as.integer(outcome == 3)
outcomes = cbind(outcome_h, outcome_d,outcome_a)
return(RPS_matrix(probs,outcomes))
}
bookmaker_analysis takes bookmaker bet data and converts them to implied probabilities.
# Use this function to analyze for different bookmakers.
bookmaker_analysis <- function(bookmaker_name, data) {
# get the data with selected bookmaker
if(bookmaker_name == TRUE){
bookmaker_bets <- data
}
else{
bookmaker_bets <- data[odd_bookmakers == bookmaker_name]
}
# find probabilities from given odds
bookmaker_bets <- bookmaker_bets[, c("p_win", "p_draw", "p_lose") := list(1/value[variable == 'odd_1'], 1/value[variable == 'odd_x'], 1/value[variable == 'odd_2']), by = list(match_id, odd_bookmakers, variable)]
# group data by match_id and odd_bookmakers, so that we can have only one row for each match and bookmaker
bookmaker_bets <- aggregate(bookmaker_bets[,c("p_win","p_draw","p_lose")], by=list(bookmaker_bets$match_id, bookmaker_bets$odd_bookmakers), FUN = function (x) first(na.omit(x)))
# convert it back to data table
bookmaker_bets <- data.table(bookmaker_bets)
# rename grouped columns
names(bookmaker_bets)[1] <- "match_id"
names(bookmaker_bets)[2] <- "odd_bookmakers"
# convert probabilities by implied probability function for each probability type
bookmaker_bets[, p_win_imp:=p_win / (p_win + p_draw + p_lose)]
bookmaker_bets[, p_draw_imp:=p_draw / (p_win + p_draw + p_lose)]
bookmaker_bets[, p_lose_imp:=p_lose / (p_win + p_draw + p_lose)]
}
We try to analyze past matches with their RPS values. There can be many matches with same ID belongs to different bookmaker, since each bookmaker gives different implied probability to each matches.
past_match_predictions <- bookmaker_analysis(TRUE, merged_data)
past_match_predictions <- past_match_predictions[, -c("p_win","p_draw","p_lose")]
setDT(past_match_predictions, key = "match_id")[finished_epl_matches, outcome := i.outcome]
setDT(past_match_predictions, key = "match_id")[finished_epl_matches, epoch := i.epoch]
past_match_predictions
This one creates a different matrix for outcomes in a form that our RPS functions expect. Then we find RPS value of each finished match according to their outcomes and their implied probability.
past_match_predictions[,pred_id:=1:.N]
outcome_for_rps=data.table::dcast(past_match_predictions,pred_id~outcome,value.var='pred_id')
outcome_for_rps[,pred_id:=NULL]
outcome_for_rps[is.na(outcome_for_rps)]=0
outcome_for_rps[outcome_for_rps>0]=1
overall_results <- past_match_predictions[,rps := RPS_matrix(past_match_predictions[,list(p_lose_imp,p_draw_imp,p_win_imp)],outcome_for_rps)]
overall_results[,pred_id:=NULL]
overall_results <- data.table(overall_results)
overall_results
Here we find each bookmakerās average RPS. We tried to construct a āleaderboardā for bookmakers where theyāre sorted by their mean RPS values. So we can select best bookmakers from this table. There are some bookmakers did really good job in terms of RPS but we found that they did not predict many matches. So we did not manually select bookmaker names from this table.
bookmaker_rps <- aggregate(overall_results[, rps], list(overall_results$odd_bookmakers), mean)
names(bookmaker_rps)[1] <- "odd_bookmakers"
names(bookmaker_rps)[2] <- "rps_mean"
bookmaker_rps <- data.table(bookmaker_rps)
bookmaker_rps <- bookmaker_rps[order(rps_mean)]
bookmaker_rps
mean(bookmaker_rps$rps_mean)
## [1] 0.2276873
Selecting āthe bestā bookmaker would not be the wisest approach since depending only one bookmaker is not a great opinion. Instead we tried to take average of ābest n bookmakerā implied probabilities. In order to do that we had to find this ānā. At the beginning we manually set this n as 3 (n=3). However then we realized that we can take it further and find this n from the data. We ran this code once to find optimal n. As our findings, having n=5 is the optimal choice. Since this code breaks the other parts and we did run it only once, we commented it out.
# Don't run this with other codes. This is a test code for to find optimal n (5 currently).
for (n in seq(0,60,1)) {
setDT(past_match_predictions, key = "odd_bookmakers")[bookmaker_rps, bookmaker_rps_ave := i.rps_mean]
first_n_bookmakers = past_match_predictions[order(match_id, bookmaker_rps_ave)]
first_n_bookmakers = first_n_bookmakers[, .SD[1:n], by=match_id]
submission = first_n_bookmakers[, lapply(.SD, mean), .SDcols = average_cols, by=match_id]
past_match_predictions = past_match_predictions[, .SD, by=match_id]
setDT(submission, key = "match_id")[past_match_predictions, outcome := i.outcome]
submission[,pred_id:=1:.N]
outcome_for_rps=data.table::dcast(submission,pred_id~outcome,value.var='pred_id')
outcome_for_rps[,pred_id:=NULL]
outcome_for_rps[is.na(outcome_for_rps)]=0
outcome_for_rps[outcome_for_rps>0]=1
overall_results <- submission[,rps := RPS_matrix(submission[,list(p_lose_imp,p_draw_imp,p_win_imp)],outcome_for_rps)]
overall_results[,pred_id:=NULL]
overall_results <- data.table(overall_results)
overall_results = overall_results[c(-1,-2)]
cat("n= ",n,"mean= ",mean(overall_results$rps, na.rm=TRUE),"\n")
}
We find upcoming matches by looking at their epochs.
unplayed_epl_matches <- matches[league_id == 148 & is.na(match_status) & epoch < as.integer(as.POSIXct("2019-12-30"))]
unplayed_epl_matches <- unplayed_epl_matches[,-c("match_status", "match_hometeam_score", "match_awayteam_score", "league_id")]
unplayed_epl_matches
We also convert bet odds of upcoming matches to implied probabilities. We also put bookmakerās mean RPS value in the table.
unplayed_merged_data <- merge(unplayed_epl_matches, bets, by='match_id')
unplayed_predictions <- bookmaker_analysis(TRUE, unplayed_merged_data)
unplayed_predictions <- unplayed_predictions[, -c("p_win","p_draw","p_lose")]
unplayed_predictions <- merge(unplayed_epl_matches, unplayed_predictions, by='match_id')
setDT(unplayed_predictions, key = "odd_bookmakers")[bookmaker_rps, bookmaker_rps_ave := i.rps_mean]
unplayed_predictions
We take average of implied probabilities just to see them.
average_cols = c("p_win_imp","p_draw_imp","p_lose_imp")
average_predictions = unplayed_predictions[, lapply(.SD, mean), .SDcols = average_cols, by=match_id]
average_predictions = merge(unplayed_epl_matches, average_predictions, by='match_id')
average_predictions
Here we find best ānā bookmakers and take average of their implied probabilities. Concatinate them to form our submission text.
n = 5
first_n_bookmakers = unplayed_predictions[order(match_id, bookmaker_rps_ave)]
first_n_bookmakers = first_n_bookmakers[, .SD[1:n], by=match_id]
submission = first_n_bookmakers[, lapply(.SD, mean), .SDcols = average_cols, by=match_id]
submission_txt = paste(t(submission[, c("match_id", "p_win_imp", "p_draw_imp", "p_lose_imp")]), collapse=',')
submission_txt
## [1] "273291,0.320750168385935,0.255694206555901,0.423555625058164,273292,0.514999959326059,0.256387278343203,0.228612762330738,273293,0.245522815911084,0.265106226946046,0.48937095714287,273294,0.724441681576651,0.178925550853189,0.0966327675701606,273295,0.813337601450492,0.125833146727795,0.0608292518217134,273296,0.294848365795222,0.283994975257029,0.421156658947749,273297,0.181391106086424,0.221807943079548,0.596800950834028,273298,0.474215762202999,0.275686182129787,0.250098055667214,273299,0.485795365926461,0.251914734354526,0.262289899719013,273300,0.376835456695125,0.267121062849755,0.35604348045512"
We can find our submission RPS value by providing outcome set.
submission$outcome = c(-1, -1, 1, 1,0,0,0,1,1,0)
submission[,pred_id:=1:.N]
outcome_for_rps=data.table::dcast(submission,pred_id~outcome,value.var='pred_id')
outcome_for_rps[,pred_id:=NULL]
outcome_for_rps[is.na(outcome_for_rps)]=0
outcome_for_rps[outcome_for_rps>0]=1
overall_results <- submission[,rps := RPS_matrix(submission[,list(p_lose_imp,p_draw_imp,p_win_imp)],outcome_for_rps)]
overall_results[,pred_id:=NULL]
overall_results <- data.table(overall_results)
overall_results